Introduction

The NBA stands for the National Basketball Association, and it’s a professional basketball league in the North America. Over its long history, there has been many talented basketball players, some of best in the world.

Last year(2022) was the 75th anniversary of the NBA, and the greatest 75 players of all time were announced:

A picture of the NBA 75 Anniversary Team: NBA 75 Anniversary Team.

Although all 75 of them are great, there has always been a debate on who is the “greatest”. Most people argue that Michael Jordan is the greatest player of all time:

However, some believes Lebron James is the greatest of all time:

This debate will probably go on forever without a definite answer. Another popular debate from the anniversary team is who is “snubbed” from the “greatest 75 player”. One of my favorite players of all time: Penny Hardaway did not make the Team. Penny had a great start to his career, however, his prime was short lived because of injuries.

Therefore, the goal of this project to see what are some of the factors that can help determine if a current player or a retired player can be regarded as one of the greatest ever the next time the NBA announce an anniversary team, from a data science perspective.

This project includes 5 different models, build from the training set. Their performances are individually evaluated using different metrics, and final models are determined from this process

Initital Data Processing

The data set is obtained from Kaggle. The link is provided in the reference section.

The data set includes some common information about a player: height, weight, how many years they played in the NBA… Some are very valuable predictors for the goal of this project.

This section is dedicated to import the data as well as modify the raw data to make it suitable for model building.

library(readr) # Read Data 
data <- read_csv("common_player_info.csv")
df <- data[1:3000, c('height','weight','season_exp','position','school', 'country', 'draft_round','greatest_75_flag')]

# modify the height data to convert the height data to cms 
df$height <- 2.54*(12* as.numeric(sub("\\-.*", "", df$height)) + as.numeric(sub(".*\\-", "", df$height)))

# Inspect Missing Data 
vis_miss(df)

df <- df %>% drop_na()

# modify the position data 
# coded each position into a number from 1 to 5
for ( x in 1:nrow(df)) {
  if(df$position[x] == "Forward")
    df$position[x] = 3
  else if (df$position[x] == "Guard" || df$position[x] == "Guard-Forward")
    df$position[x] = 1
  else if (df$position[x] == "Center")
    df$position[x] = 5
  else if (df$position[x] == "Forward-Guard")
    df$position[x] = 2
  else if (df$position[x] == "Forward-Center" || df$position[x] == "Center-Forward")
    df$position[x] = 4
}

# modify the country data 
# Code country into USA and International
for ( x in 1:nrow(df)) {
  if(df$country[x] != "USA")
    df$country[x] = "International"
}

# modify daft data. Undrafted players set to be round "0"
for ( x in 1:nrow(df)) {
  if(df$draft_round[x] == "Undrafted")
    df$draft_round[x] = 0
}

# modify draft round data to only 2 rounds. 
df$draft_round <- as.numeric(df$draft_round )

for ( x in 1:nrow(df)) {
  if(df$draft_round[x] > 2)
    df$draft_round[x] = 0
}

First by checking the amount of the data that are missing, there is only 1.7 percent. Therefore, it’s safe to remove the observations with missing data.

Then the height data of the data set is modified to cms. The position of a player is coded into numbers from 1 to 5. The convention in the world of basketball is that 1 is a point guard, 2 is a shooting guard, 3 is a small forward, 4 is a power forward and 5 is a center. This project proceeds with this convention.

It’s also common in the basketball world to regard a player as “domestic” or “international” due the vast differences in playing style between a US player and an international player. Besides the US, other countries have 1, maybe 2 players in the NBA, therefore, players will be regarded as “domestic” or “international” for this project.

For the undrafted players, the draft round is modified to be 0. Since the NBA changed the draft rounds to 2 in recent years, all players who got drafted past round 2 will be modified to 2.

Data Split

Data is split into training and testing set in this section. The radio of the training data size to the testing data size is set to be 7:3. The split is stratified by the factor “greatest_75_flag”

To provide some context, the training data is what’s used to build the models and the testing data is used to validate those models.

# Split the data set
set.seed(2)
df_split <- initial_split(df, strata = greatest_75_flag, prop = 0.7) 
df_split
## <Training/Testing/Total>
## <1926/826/2752>
# Define the training and testing data.
nba_train <- training(df_split)
nba_test <- testing(df_split)

The following section is to modify draft round, country and position values to factors:

# draft round, country and position are all factor variables 

df$greatest_75_flag <- factor(df$greatest_75_flag, levels = c("N", "Y"))

nba_train$greatest_75_flag <- as.factor(nba_train$greatest_75_flag)
nba_test$greatest_75_flag <- as.factor(nba_test$greatest_75_flag)


df<-df %>% mutate(draft_round = factor(draft_round), country = factor(country), position = factor(position), greatest_75_flag = factor(greatest_75_flag))

Now that the data set is ready to be used to fit different models, but first, some EDA plots can be helpful to get a sense of the relationship between the variabels:

Exploratory data analysis

In this section, some EDA are performed. Including two bar chart, correlation plot, and a density curve

Then a table is provided for the number of top 75 players in each data set.

# Bar chart of the split based on if a player is top 75
df %>% 
  ggplot(aes(x = greatest_75_flag)) + geom_bar(color='yellow', fill ='lightblue') +
  labs(title = "Bar Chart")

# Correlation plot
cor_lab <- df %>% select(-greatest_75_flag) %>% correlate()
rplot(cor_lab) + 
labs(title = "Correlation Plot")

# Density Curve based on the number of years a player spent in the NBA
ggplot(df, aes(x = season_exp,fill = greatest_75_flag)) + 
  geom_density(alpha=0.4) +
  scale_fill_manual(values = c("#FF9933", "#33CC00", "999999"))  + 
  labs(title = "Density Plot(Season Experince)")

# Bar chart based on the position
ggplot(df, aes(x = position,fill = greatest_75_flag)) + 
 geom_bar() + 
 scale_fill_manual(values = c('plum2', "seagreen", "999999")) +
 labs(title = "Bar Chart(Position)")

# Table for greatest 75 flag
train_table <- table(nba_train$greatest_75_flag)
test_table <- table(nba_test$greatest_75_flag)
results_Y <- c(37, 8)
results_N <- c(1889, 818)
results_label <- c("Training Set", "Testing Set")
results_table <- cbind(results_label ,results_Y, results_N)
knitr::kable(results_table,'pipe', col.names = c('',"Greastest 75", "Not greatest 75"))
Greastest 75 Not greatest 75
Training Set 37 1889
Testing Set 8 818

Upon examining the bar chat, the number of players who are not one of the greatest player of all time is overwhelmingly more than than the number of players who are. Which is expected,and upsampling should be considered.

From the correlation plot, height and weight are positively correlated with a strong correlation, which is expected. The number of years that a player spends in the NBA does not have any correlation with his height and weight.

Upon examining the density plot(Season Experience). Most of the greatest 75 players played more than 10 years in the NBA. No players who spent less than 5 years in the NBA are regarded as one of the greatest. This result makes sense, since in general, more years a players can play, the better that player is.

Upon examining the Bar Chart(Position). If a player is a point guard,a small forward, or a center, then he has a better chance of being in the top 75.

Now there is a better sense of the data, the project proceeds with model fitting.

Define recipe. K-Fold CVV

This short section is dedicated to define a recipe and create folds to prepare for cross-fold validation:

Define Recipe

The recipe is defined based on both of the relevent numeric and factor variables prepared in the previous sections

# Recipe
nba_recipe <- recipe (greatest_75_flag ~ height + weight + season_exp + position + country + draft_round, data = nba_train) %>% 
  step_dummy(position) %>%
  step_dummy(country) %>%
  step_upsample(greatest_75_flag, over_ratio = 1, skip = TRUE)

prep(nba_recipe) %>% bake(new_data = nba_train) %>%
  group_by(greatest_75_flag)
## # A tibble: 1,926 × 10
## # Groups:   greatest_75_flag [2]
##    height weight season_exp draft_round greatest_75_flag position_X2 position_X3
##     <dbl>  <dbl>      <dbl>       <dbl> <fct>                  <dbl>       <dbl>
##  1   208.    240          5           1 N                          0           1
##  2   206.    235         10           1 N                          0           0
##  3   218.    225         20           1 Y                          0           0
##  4   185.    162          9           1 N                          0           0
##  5   198.    235          7           1 N                          1           0
##  6   206.    245         13           1 N                          0           1
##  7   190.    195          2           0 N                          0           1
##  8   198.    200          3           2 N                          0           0
##  9   203.    225          2           1 N                          0           1
## 10   196.    185          3           2 N                          0           0
## # ℹ 1,916 more rows
## # ℹ 3 more variables: position_X4 <dbl>, position_X5 <dbl>, country_USA <dbl>

Define folds for CVV, 10-fold CV

This section defines a 10-fold CV, and the folds are stratified by “greatest_75_flag”

To provide some context, a 10-fold CV will allow the model to fit to the 10 different folds of the data, which would produce better results.

nba_folds <- vfold_cv(nba_train, v = 10, strata = greatest_75_flag)

Model Fitting

In this section, 5 different models are constructed: Logistic Regression, K Nearest Neighbors, Elastic Net Regression, Random forest and Gradient Boosted Tree

Logistic Regression

In this section, a logistic Regression Model is fitted:

# Define glm engine 
log_model <- logistic_reg() %>%
  set_engine("glm") %>% 
  set_mode("classification") 

# Define logistic regression Workflow
log_wf <- workflow() %>% 
  add_model(log_model) %>% 
  add_recipe(nba_recipe)

# Fit the model for each fold
tune_res_log <- tune_grid(
  object = log_wf,
  resamples = nba_folds
)

# Select the best model from Logistic Regression
best_log <- show_best(tune_res_log, metric = "roc_auc")
best_log
## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.839     9  0.0538 Preprocessor1_Model1
# Final fit for Logistic Regression. 
final_log_model <- finalize_workflow(log_wf,best_log)
final_log_model <- fit(final_log_model, nba_train)

KNN

In this section, a KNN Model is fitted as well as select the best model(neighbors is a tuning parameter):

library(kknn)
set.seed(2)

# Define KNN model
knn_model <- nearest_neighbor(neighbors = tune()) %>% # neighbors as tuning parameter
  set_engine("kknn") %>% 
  set_mode("classification") 

# Define KNN workflow
knn_wf <- workflow() %>%
  add_model(knn_model) %>%
  add_recipe(nba_recipe)

# Define neighbors grid
neighbors_grid <- grid_regular(neighbors(range = c(1, 20)), levels = 20)

# Fit the model for each fold
tune_res_knn <- tune_grid(
  object = knn_wf, 
  resamples = nba_folds, 
  grid = neighbors_grid
)

# Select the best model from KNN
best_neighbors <- select_by_one_std_err(tune_res_knn, metric = "roc_auc", neighbors)
best_neighbors
## # A tibble: 1 × 9
##   neighbors .metric .estimator  mean     n std_err .config          .best .bound
##       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>            <dbl>  <dbl>
## 1         9 roc_auc binary     0.752     9  0.0494 Preprocessor1_M… 0.775  0.736
# Final fit for KNN. 
knn_final_wf <- finalize_workflow(knn_wf, best_neighbors)
knn_final_fit <- fit(knn_final_wf, nba_train)

From the output, a neighbors of 9 produces the best given the metric to be area under the curve.

Elastic Net Regression

In this section, an elastic Model is fitted as well as select the best model(mixture and penalty are tuning parameters):

set.seed(2)
# Define the ENR model/ Set up tuning parameters
enr_model <- logistic_reg(mixture = tune(), 
                          penalty = tune()) %>%
  set_mode("classification") %>%
  set_engine("glmnet")

# Define ENR workflow
enr_wf <- workflow() %>%
  add_recipe(nba_recipe) %>%
  add_model(enr_model)

# Define tuning grid for ENR
enr_grid <- grid_regular(penalty(range = c(0, 1),
                                     trans = identity_trans()),
                        mixture(range = c(0, 1)),
                             levels = 10)

# Fit the ENR models 
tune_res_enr <-  tune_grid(
  object = enr_wf,
  resamples = nba_folds,
  grid = enr_grid)

# Select the best parameters.
best_enr <- select_by_one_std_err(tune_res_enr, penalty,mixture, metric = "roc_auc")
best_enr
## # A tibble: 1 × 10
##   penalty mixture .metric .estimator  mean     n std_err .config    .best .bound
##     <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>      <dbl>  <dbl>
## 1   0.111   0.222 roc_auc binary     0.924     9  0.0149 Preproces… 0.932  0.919
# Finalize the fit for ENR
enr_final_wf <- finalize_workflow(enr_wf, best_enr)
enr_final_fit <- fit(enr_final_wf, nba_train)

From the output, a penalty of 0.11 and mixture of 0.22 produces the best given the metric to be area under the curve.

Random Forest

In this section, a random forest model is fitted as well as select the best model(mtry,min_n, and trees are tuning parameters). To provide some context : mtry: # Randomly Selected Predictors, trees: # Trees , min_n: Minimal Node Size

# Define RF model
forest_model <- rand_forest(mtry = tune(), trees = tune(), min_n = tune()) %>%
  set_engine("ranger",  importance = "impurity") %>% 
  set_mode("classification") %>% 
  translate()

# Define tuning grid for rf model
tree_grid <- grid_regular(mtry(range = c(1, 6)),levels = 6,
                        trees(range = c(400, 1800)),
                        min_n(range = c(10, 20)))

# Set up the work flow of random forest
tree_wf <- workflow() %>% 
  add_model(forest_model) %>%
  add_recipe(nba_recipe)
tune_tree <- tune_grid(
  object = tree_wf,
  resamples = nba_folds,
  grid = tree_grid
)

save(tune_tree, file = "nba_tune_tree.rda")
load("nba_tune_tree.rda")
autoplot(tune_tree)

# select best parameters for random forest
best_tree  <- select_best(tune_tree)
best_tree
## # A tibble: 1 × 4
##    mtry trees min_n .config               
##   <int> <int> <int> <chr>                 
## 1     4   960    20 Preprocessor1_Model196

From the plot, more minimal node size gives a better accuracy. For area under the curve, a higher number of trees helps the performance.

From the output, a mtry of 4, trees of 960 and min_n of 20 produces the best given the metric to be area under the curve.

# Finalize random forest wf 
final_rf_model <- finalize_workflow(tree_wf,best_tree)
final_rf_model <- fit(final_rf_model, nba_train)
final_rf_model %>% extract_fit_parsnip() %>% 
  vip() +
  theme_light()

From the importance plot, the number of seasons the player spent playing in the NBA is the most important factor. Weight, height and draft round also play a role. The center position also has an importance. The explanation could be some of the best players in the history of the sport are centers. It’s not until recent years that the guards are regarded as the best players.

Gradient-Boosted Tree Model

In this section, a gradient- boosted tree model is fitted as well as select the best model(mtry,trees and learn rate are tuning parameters).

# Define the gb model
bt_model <- boost_tree(mtry = tune(), 
                       trees = tune(),
                       learn_rate = tune()) %>%
  set_engine("xgboost") %>% 
  set_mode("classification") 

# Define the tuning grid for gb model
bt_grid <- grid_regular(mtry(range = c(1, 6)),levels = 6,
                        trees(range = c(400, 1800)),
                        learn_rate())

# Set up the work flow of random forest
bt_wf <- workflow() %>% 
  add_model(bt_model) %>%
  add_recipe(nba_recipe)
tune_bt <- tune_grid(
  bt_wf,
  resamples = nba_folds,
  grid = bt_grid
)
save(tune_bt, file = "nba_tune_bt.rda")
load("nba_tune_bt.rda")
autoplot(tune_bt) + theme_minimal()

best_bt <- select_best(tune_bt)
best_bt
## # A tibble: 1 × 4
##    mtry trees learn_rate .config               
##   <int> <int>      <dbl> <chr>                 
## 1     6   960  0.0000251 Preprocessor1_Model201

From the out, a mtry of 6, trees 960 and learn-rate of 2.512e-05 produce the best result

# Finalize the boosted tree wf 
final_bt_model <- finalize_workflow(bt_wf,best_bt)
final_bt_model <- fit(final_bt_model, nba_train)

Model Predictions/Performance Review

In this section, each model is fitted to the testing data. Then the performance of each model is accessed by the confusion matrix as well as AUC plots

Even though the AUC performs good, the actual predictions aren’t accurate. Therefore, for all models, the threshold will be adjusted accordingly

Logistic Predictions and Confusion matrix/ AUC/ AUC plot

final_log_model_test <- augment(final_log_model, 
                               nba_test)
# Assess model threshold 
final_log_model_test %>% 
  roc_curve(greatest_75_flag, .pred_Y)
## # A tibble: 785 × 3
##     .threshold specificity sensitivity
##          <dbl>       <dbl>       <dbl>
##  1 -Inf                  0       1    
##  2    0.000777           0       1    
##  3    0.00120            0       0.999
##  4    0.00121            0       0.998
##  5    0.00173            0       0.996
##  6    0.00190            0       0.995
##  7    0.00194            0       0.994
##  8    0.00203            0       0.993
##  9    0.00208            0       0.991
## 10    0.00212            0       0.990
## # ℹ 775 more rows
# Confusion matrix heatmap
augment(final_log_model,new_data = nba_test) %>% 
  mutate(.pred_class = factor(if_else(.pred_Y < 0.494051, "N", "Y"), levels = c("N", "Y"))) %>%
 conf_mat(greatest_75_flag, .pred_class) %>%
  autoplot(type = "heatmap")

# AUC
augment(final_log_model, new_data = nba_test) %>% 
  mutate(.pred_class = factor(if_else(.pred_Y < 0.494051, "N", "Y"), levels = c("N", "Y"))) %>%
  roc_auc(greatest_75_flag, .pred_N)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.892
# ROC plot
augment(final_log_model, new_data = nba_test) %>%
  mutate(.pred_class = factor(if_else(.pred_Y < 0.494051, "N", "Y"), levels = c("N", "Y"))) %>%
  roc_curve(greatest_75_flag, .pred_N) %>%
  autoplot()

Knn Predictions and Confusion matrix/ AUC/ AUC plot

final_knn_model_test <- augment(knn_final_fit, 
                               nba_test)
# Assess model threshold 
final_knn_model_test %>% 
  roc_curve(greatest_75_flag, .pred_Y)
## # A tibble: 12 × 3
##    .threshold specificity sensitivity
##         <dbl>       <dbl>       <dbl>
##  1 -Inf             0          1     
##  2    0             0          1     
##  3    0.00777       0.5        0.0954
##  4    0.0321        0.5        0.0880
##  5    0.0748        0.5        0.0795
##  6    0.138         0.5        0.0685
##  7    0.226         0.5        0.0599
##  8    0.342         0.625      0.0513
##  9    0.494         0.625      0.0367
## 10    0.696         0.75       0.0269
## 11    1             0.75       0.0134
## 12  Inf             1          0
# Confusion matrix heatmap
augment(knn_final_fit,new_data = nba_test) %>% 
  mutate(.pred_class = factor(if_else(.pred_Y < 0.12788, "N", "Y"), levels = c("N", "Y")))%>% 
  conf_mat(greatest_75_flag, .pred_class) %>%
  autoplot(type = "heatmap")

# AUC
augment(knn_final_fit, new_data = nba_test) %>% 
  mutate(.pred_class = factor(if_else(.pred_Y < 0.12788, "N", "Y"), levels = c("N", "Y"))) %>%
  roc_auc(greatest_75_flag, .pred_N)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.714
# ROC plot
augment(knn_final_fit, new_data = nba_test) %>%
  mutate(.pred_class = factor(if_else(.pred_Y < 0.12788, "N", "Y"), levels = c("N", "Y")))%>%
  roc_curve(greatest_75_flag, .pred_N) %>%
  autoplot()

ENR Predictions and Confusion matrix/ AUC/ AUC plot

enr_predictions <- predict(enr_final_fit, nba_test)

final_enr_model_test <- augment(enr_final_fit, 
                               nba_test)
# Assess model threshold 
final_enr_model_test %>% 
  roc_curve(greatest_75_flag, .pred_Y)
## # A tibble: 23 × 3
##    .threshold specificity sensitivity
##         <dbl>       <dbl>       <dbl>
##  1   -Inf               0       1    
##  2      0.129           0       1    
##  3      0.152           0       0.979
##  4      0.179           0       0.767
##  5      0.209           0       0.592
##  6      0.243           0       0.480
##  7      0.280           0       0.417
##  8      0.321           0       0.352
##  9      0.364           0       0.312
## 10      0.410           0       0.276
## # ℹ 13 more rows
# Confusion matrix heatmap
augment(enr_final_fit,new_data = nba_test) %>% 
   mutate(.pred_class = factor(if_else(.pred_Y < 0.5590, "N", "Y"), levels = c("N", "Y")))%>%
 conf_mat(greatest_75_flag, .pred_class) %>%
  autoplot(type = "heatmap")

# AUC
augment(enr_final_fit, new_data = nba_test) %>% 
  mutate(.pred_class = factor(if_else(.pred_Y < 0.5590, "N", "Y"), levels = c("N", "Y")))%>%
  roc_auc(greatest_75_flag, .pred_N)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.907
# ROC plot
augment(enr_final_fit, new_data = nba_test) %>%
   mutate(.pred_class = factor(if_else(.pred_Y < 0.5590, "N", "Y"), levels = c("N", "Y")))%>%
  roc_curve(greatest_75_flag, .pred_N) %>%
  autoplot()

Random forest and Confusion matrix/ AUC/ AUC plot

final_rf_model_test <- augment(final_rf_model, 
                               nba_test)

# Assess model threshold 
final_rf_model_test %>% 
  roc_curve(greatest_75_flag, .pred_Y)
## # A tibble: 265 × 3
##     .threshold specificity sensitivity
##          <dbl>       <dbl>       <dbl>
##  1 -Inf                  0       1    
##  2    0                  0       1    
##  3    0.000208           0       0.577
##  4    0.000294           0       0.559
##  5    0.000381           0       0.550
##  6    0.000502           0       0.490
##  7    0.000595           0       0.483
##  8    0.000639           0       0.482
##  9    0.000639           0       0.479
## 10    0.000647           0       0.471
## # ℹ 255 more rows
# Confusion matrix heatmap
augment(final_rf_model, new_data = nba_test) %>% 
  mutate(.pred_class = factor(if_else(.pred_Y < 0.0428612, "N", "Y"), levels = c("N", "Y"))) %>%
  conf_mat(truth = greatest_75_flag, estimate = .pred_class) %>% 
  autoplot(type = "heatmap")

# AUC
augment(final_rf_model, new_data = nba_test) %>% 
  mutate(.pred_class = factor(if_else(.pred_Y < 0.0428612, "N", "Y"), levels = c("N", "Y"))) %>% roc_auc(greatest_75_flag, .pred_N)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.899
# ROC curve
final_rf_model_test %>% 
  mutate(.pred_class = factor(if_else(.pred_Y < 0.0428612, "N", "Y"), levels = c("N", "Y"))) %>%
  roc_curve(greatest_75_flag, .pred_N) %>% 
  autoplot()

Gradient-Boosted Trees Confusion matrix/ AUC/ AUC plot

final_bt_model_test <- augment(final_bt_model, 
                               nba_test)
# Assess model threshold 
final_bt_model_test %>% 
  roc_curve(greatest_75_flag, .pred_Y)
## # A tibble: 257 × 3
##    .threshold specificity sensitivity
##         <dbl>       <dbl>       <dbl>
##  1   -Inf               0       1    
##  2      0.488           0       1    
##  3      0.488           0       0.886
##  4      0.488           0       0.869
##  5      0.488           0       0.786
##  6      0.488           0       0.740
##  7      0.488           0       0.699
##  8      0.488           0       0.697
##  9      0.488           0       0.693
## 10      0.488           0       0.680
## # ℹ 247 more rows
# Confusion matrix heatmap
augment(final_bt_model, new_data = nba_test) %>%  
  mutate(.pred_class = factor(if_else(.pred_Y < 0.4915, "N", "Y"), levels = c("N", "Y"))) %>%
  conf_mat(truth = greatest_75_flag, estimate = .pred_class) %>% 
  autoplot(type = "heatmap")

# AUC 
augment(final_bt_model, new_data = nba_test) %>% 
  mutate(.pred_class = factor(if_else(.pred_Y < 0.4915, "N", "Y"), levels = c("N", "Y"))) %>% 
  roc_auc(greatest_75_flag, .pred_N)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.874
# ROC curve
final_bt_model_test %>% 
  mutate(.pred_class = factor(if_else(.pred_Y < 0.4915, "N", "Y"), levels = c("N", "Y"))) %>% 
  roc_curve(greatest_75_flag, .pred_N) %>% 
  autoplot()

Results table

The results are summarized into the following table.

results_log <- c('0.892', '5', '676')
results_knn <- c('0.718','3','776')
results_enr <- c('0.907','5','716')
results_rf <- c('0.899','7','704')
results_bt <- c('0.876','6','662')
performance_labels <- c('AUC','True Positive Count','True Negative Count')
perforamnce_results_table <- cbind(results_log,results_knn,results_enr,results_rf,results_bt)
perforamnce_results_table <- cbind(performance_labels,perforamnce_results_table)

knitr::kable(perforamnce_results_table,'pipe',col.names = c('Metric','Logistic Regression','KNN','Elastic Net Regression','Random forest','Gradient-Boosted Tree'))
Metric Logistic Regression KNN Elastic Net Regression Random forest Gradient-Boosted Tree
AUC 0.892 0.718 0.907 0.899 0.876
True Positive Count 5 3 5 7 6
True Negative Count 676 776 716 704 662

The following table is the true counts in the testing set.

# Testing Count table
results_test_Y <- c('8')
results_test_N <- c('818')
results_test <- rbind(results_test_Y,results_test_N)
rownames(results_test)= c('Greatest 75','Not Greatest 75')
knitr::kable(results_test,'pipe')
Greatest 75 8
Not Greatest 75 818

Overall, some models performs better when predicting true positives and true negatives Some others performs better by the metric of the area under the ROC curve.

For the metric of AUC, elastic net regression performs the best. Random forest and logistic regression performs slightly worse.

For predicting true positives, random forest model performs the best, then follows gradient-boosted tree

For predicting true negatives, KNN works the best. ENR and random forest also performs well.

With all factors considered, random forest and ENR are the best models

Conclusion

To recap, the goal of this project is to determine what are the factors to determine if an NBA player is one the greatest of all time, strictly from a data science perspective.

The data is first modified in order to better fit the models. Then a recipe is defined as well as prepare the folds for cross-fold validation.

Then 5 different models are fitted using the training set: logistic regression, K nearest neighbors, elastic net regression, random forest, gradient boosted tree.

All 5 models are fitted to the testing data.

The performance of each model is assessed by confusion matrix, area under the curve and the ROC plot.

Even though there isn’t a model that performs the best across all metrics, but the random forest model performs good if not the best in all metrics. Therefore, it’ll be the best model for this project. ENR can be the alternative best model.Gradient-boosted tree and Logistic regression also performs good.Surprisingly, KNN did not perform that well comparing to other models.

To address the initial question, a lot of factors play a role in determine if a player can be one of the greats. The number of years he played in the NBA plays the biggest role. However, other factors also carry a weight.By using the statistical models presented in this project, one can predict if a current player or a retired player will be regarded as one of the great of all time.

The next steps for this project can be experimenting with interactions among the variables and potentially fit more statistical models.

(Extra) Apply the Models to Levi

Levi is a basketball player that I know. He has played 0 years in the NBA, most of the days, he spends his time playing basketball at 53rd Avenue Community Park in Hillsboro, OR. Levi plays the power forward position and he attended Portland Community College. A reasonable assumption is that Levi will never be a top 75 player of all time.

Levi
Levi
Portland Community College
Portland Community College

To test the model accuracy, the random forest and ENR models will be applied to Levi, and check if it’ll produce a “No”

Creating a profile for Levi

# Creating a df for Levi
levi_infor <- c(177, 160, 0, 4, 'Portland Coummnity College','USA', 0,'N')
nba_test_addlevi = rbind(nba_test, levi_infor)

nba_test_addlevi<-nba_test_addlevi %>% 
  mutate(height = as.numeric(height), weight = as.numeric(weight), season_exp = as.numeric(season_exp),draft_round = as.numeric(draft_round), greatest_75_flag = factor(greatest_75_flag))

nba_test_addlevi[nrow(nba_test_addlevi),]
## # A tibble: 1 × 8
##   height weight season_exp position school  country draft_round greatest_75_flag
##    <dbl>  <dbl>      <dbl> <chr>    <chr>   <chr>         <dbl> <fct>           
## 1    177    160          0 4        Portla… USA               0 N

Apply Random forest/ ENR

# Random Forrest Predictions
levi_rf <- predict(final_rf_model, new_data = nba_test_addlevi[nrow(nba_test_addlevi),])

# ENR Predictions 
levi_ENR <- predict(enr_final_fit, new_data = nba_test_addlevi[nrow(nba_test_addlevi),])

levi_results <- c(levi_rf$.pred_class, levi_ENR$.pred_class)
levi_results_label <- c('Random forest', 'ENR')
levi_table = data_frame(levi_results_label,levi_results)
knitr::kable(levi_table,'pipe', col.names = c('Model','Testing results'))
Model Testing results
Random forest N
ENR N

Both of the models return a “No”, which is expected.

References:

A huge Thank You to Dr.Coburn for helping me with this project!

The data set is obtained from Kaggle :

Kaggle link